Author

Ella Addink

Code
library(vegabrite)
library(dplyr)
library(tidyr)
library(jsonlite)
library(readxl)
library(lubridate)
library(mapview)
library(sf)
library(geojson)
library(geojsonsf)

Exercise 1 - Twin Genetics

a. An examples with something I like

I like Daniel’s use of points for his graphic for comparing twins. I think the points make the whole graphic easier to look at and a person can perceive where there are differences between twins. Although he does not differentiate between kits, I like how there are not facets or a selector and I can look at the data all at once.

b. An example with something I don’t like

I don’t love Sucry’s graphic for comparing kits because genetic share seems to be treated as an ordinal categorical variable and I am not sure of his reasoning for doing so. I do not think it does the best at comparing kits because I am not sure what the graphic would look like if the kits all reported the same DNA proportions.

c-d. Graphics and stories

Code
genetics_wide <- read.csv("https://calvin-data304.netlify.app/data/twins-genetics-wide.csv")
genetics_long <- read.csv("https://calvin-data304.netlify.app/data/twins-genetics-long.csv")

Comparing twins

Code
vl_chart() |>
  vl_mark_line() |>
  vl_encode_x("id:N", title = "Twin") |>
  vl_encode_y("genetic.share:Q", title = "Genetic Share") |>
  vl_encode_color("pair:N") |>
  vl_legend_color(title = "Pair") |>
  vl_encode_row("kit:N", title = "Kit") |>
  vl_encode_column("region:N", title = "Region") |>
  vl_add_data(genetics_long) |>
  vl_add_properties(height = 150, width = 80,
                    title = list(text = "Large Differences in Twin Genetic Shares for Europe and West Africa Regions",
                                 dx = 80))

One would expect identical twins to receive the same results from DNA kits, but often times this is not the case. For the 6 pairs of twins included here, we can see distinct differences in some of their measured genetic shares for the NW Europe, SE Europe, and West Africa regions. The largest difference between a pair of twins was for the SE Europe genetic share using the MyHeritage DNA kit with a difference of almost 0.1. Interestingly, most of the larger differences are occurring for one set of twins (pair 1) in both of their only genetic share regions (NW Europe and SE Europe) and using both MyHeritage and 23andMe DNA kits.

Comparing DNA kits

Code
vl_chart() |>
  vl_mark_line() |>
  vl_encode_x("kit:N", title = "Kit") |>
  vl_encode_y("genetic.share:Q", title = "Genetic Share") |>
  vl_axis_y(tickCount = 6) |>
  vl_encode_color("twin:N") |>
  vl_scale_color(scheme = "paired") |>
  vl_legend_color(title = "Twin") |>
  vl_encode_detail("twin:N") |>
  vl_encode_facet("region:N", columns = 3, title = "Region") |> 
  vl_add_properties(width = 200, height = 120,
                    title = list(text = "Ancestry DNA Kit Results Vary Distinctly From 23andMe and MyHeritage",
                                 dx = 130)) |> 
  vl_add_data(genetics_long)

When comparing the results of three different DNA kits for six pairs of twins, there are distinct differences between the genetic shares reported in the kits. For four out of the five genetic share regions included in the kits, there is at least one pair of twins where the Ancestry kit has very different results from the other two.

Exercise 2 - Data & Graphic Challenge

I chose challenge #2, re-making the customer touchpoints graphic.

Issues I noticed with this graphic:

  • Both the x and y axes could be improved. It is not easy to see which bar goes with each of the date labels. The labels also do not seem to be correct as 2018-08 comes after 2019-03. On the y axis, the number of significant digits is not consistent.

  • The emphasis on the total touchpoints as seen in the title and labels on the graphic is not the focus of the graphic itself and can not really be seen with the data given. We have information on the total number of touchpoints per customer each month, but not the total number of customers, so we cannot actually calculate/graph the total touchpoints. Therefore, with this data, the total touchpoints should probably not be the main message and if included, should probably just be in the caption or subtitle.

  • The title and touchpoint labels are also currently quite large and distracting.

  • Stacked bars are probably not the best option because it is hard to compare the amounts of different types of touchpoints within one month (although maybe this is not as important to the story). It is also hard to see the trends for chat and email touchpoints because they don’t have a common baseline.

Code
touchpoints_wide <- read.csv("https://calvin-data304.netlify.app/data/swd-lets-practice-ex-5-03.csv") |> 
  mutate(Total = Phone.Touchpoints + Chat.Touchpoints + Email.Touchpoints,
         Complete_Date_str = paste0(Date, "-01"),
         Complete_Date = ymd(Complete_Date_str)) |> 
  rename(Phone = Phone.Touchpoints,
         Chat = Chat.Touchpoints,
         Email = Email.Touchpoints)

touchpoints_long <- touchpoints_wide |> 
  pivot_longer(Phone:Total, names_to = "Type", values_to = "Touchpoints")

touchpoints_label_data <- touchpoints_long |> 
  filter(Date == "2020-01")
Code
base <- vl_chart() |>
  vl_encode_x("Complete_Date:T", title = "") |>
  vl_axis_x(format = "%b %Y") |>
  vl_encode_y("Touchpoints:Q", title = "Touchpoints per Customer") |>
  vl_encode_color("Type:N", legend = FALSE) |>
  vl_scale_color(domain = c("Chat", "Phone", "Email", "Total"),
                 range = c("#1b9e77", "#7570b3", "#d95f02", "black")) 
Warning: Invalid schema for object passed to or created by
modify_inner_spec.vegaspec_unit
Code
lines <- base |> 
  vl_mark_line() |>
  vl_add_data(touchpoints_long)

labels <- base |> 
  vl_mark_text(dx = 18) |> 
  vl_encode_text("Type:N") |>
  vl_add_data(touchpoints_label_data)

vl_layer(lines, labels) |> 
  vl_add_properties(width = 400, height = 200)

Exercise 3 - Tanzania Graphic

My data: Tanzania data

Code
tanzania <- read_excel('Tanzania_data.xlsx') |>
  mutate(contraception_use = contraception_use / 100, 
         fam_plan_unmet = fam_plan_unmet / 100)
Code
points_base <- vl_chart() |> 
  vl_encode_x("date:T", title = "") |>
  vl_mark_point(strokeWidth = 1) |> 
  vl_encode_fill(value = "lightgrey") |> 
  vl_encode_opacity(value = 1) |>
  vl_encode_color(value = "black") |> 
  vl_encode_size("fertility_rate:Q", 
                 title = "Total Fertility Rate",
                 legend = list(orient = "bottom")) |> 
  vl_scale_size(domainMin = 5, domainMax = 7, rangeMax = 170)

contraception_points <- points_base |>
  vl_encode_y("contraception_use:Q") |> 
  vl_axis_y(format = "%")

fam_plan_points <- points_base |> 
  vl_encode_y("fam_plan_unmet:Q")

contraception_line <- vl_chart() |> 
  vl_mark_line() |> 
  vl_encode_x("date:T", title = "") |> 
  vl_encode_y("contraception_use:Q", title = "") |>
  vl_encode_color(value = "#7570b3")
  
fam_plan_line <- vl_chart() |> 
  vl_mark_line() |> 
  vl_encode_x("date:T", title = "") |> 
  vl_encode_y("fam_plan_unmet:Q", title = "") |> 
  vl_encode_color(value = "#1b9e77") 

contraception_label <- vl_chart() |> 
  vl_mark_text(dx = 260, size = 12) |> 
  vl_encode_text(value = "Contraception Use") |> 
  vl_encode_y(datum = 0.384) |> 
  vl_encode_color(value = "#7570b3")

fam_plan_label <- vl_chart() |>
  vl_mark_text(dx = 288, size = 12) |>
  vl_encode_text(value = "Unmet Family Planning Need") |>
  vl_encode_y(datum = 0.221) |> 
  vl_encode_color(value = "#1b9e77")

vl_layer(fam_plan_line, contraception_line, 
         fam_plan_points, contraception_points, 
         contraception_label, fam_plan_label) |> 
  vl_add_data(tanzania) |>
  vl_encode_tooltip_array(c("date:T", "contraception_use:Q", "fam_plan_unmet:Q", "fertility_rate:Q")) |>
  vl_add_properties(width = 400, height = 300, 
                    title = "Contraception Use and Family Planning Increase in Tanzania")

Over the last few decades, the use of modern contraception in Tanzania has steadily increased with the largest increase occurring after 2010. The unmet need for family planning has decreased slightly since 1992 by about 5% total. Potentially associated with both of these is the decrease in the total fertility rate for all women ages 15-49.

Exercise 4 - Masterpiece

My data:

National Park Visits Data from a query builder found here link

National Park Locations Shapefile from here link

National Park Locations CSV

Code
# Loading in data
us_map_url <- "https://cdn.jsdelivr.net/npm/vega-datasets@2.11.0/data/us-10m.json"
park_visits <- read_excel("National_Park_Data.xlsx",
                          skip = 2)
park_shp <- read_sf("nps_boundary_centroids.shp")
# Then I used st_write(park_shp, "parks.csv", layer_options = "GEOMETRY=AS_XY") in the console to create parks.csv
parks <- read.csv("parks.csv")
Code
# Cleaning and joining data
final_parks_data <- parks |> 
  left_join(park_visits, join_by(UNIT_CODE == `Unit Code`),
            relationship = "many-to-many") |> 
  filter(!is.na(`Recreation Visits`)) |>
  select(X, Y, UNIT_CODE, UNIT_NAME, STATE, REGION, UNIT_TYPE, Year, `Recreation Visits`) |> 
  filter(STATE != "AS" & STATE != "DC" & STATE != "GU" & STATE != "MP" & STATE != "PR" & 
         STATE != "TT" & STATE != "VI") |> 
  rename(LAT = Y,
         LONG = X,
         Visits = `Recreation Visits`) |> 
  filter(UNIT_TYPE != "International Historic Site" & UNIT_TYPE != "Other Designation") |> 
  mutate(Type = case_when(UNIT_TYPE %in% c("National River", "National Lakeshore", 
                                                   "National Seashore", "National Wild & Scenic River") ~
                                    "River, Lakeshore, Seashore",
                                  UNIT_TYPE %in% c("National Historic Site", "National Historical Park") ~ 
                                    "Historic Site or Park",
                                  UNIT_TYPE == "National Scenic Trail" ~ "Scenic Trail",
                                  UNIT_TYPE %in% c("National Battlefield", "National Battlefield Park",
                                                   "National Battlefield Site", "National Memorial",
                                                   "National Military Park") ~
                                    "Memorial, Military Park, Battlefield",
                                  UNIT_TYPE %in% c("National Preserve", "National Recreation Area", 
                                                   "National Reserve") ~
                                    "Reserve, Preserve, Recreation Area",
                                  UNIT_TYPE == "National Monument" ~ "Monument",
                                  TRUE ~ UNIT_TYPE))

top_10_parks <- final_parks_data |> 
  filter(Year == 2024) |> 
  filter(Visits > 4800000)
Code
state_map <-
  vl_chart() |>
  vl_mark_geoshape(fill = "transparent", stroke = "grey", strokeWidth = 0.5) |>
  vl_add_data(
    url = us_map_url, 
    format = list(type = "topojson", feature = "states"))

park_points <- 
  vl_chart() |> 
  vl_filter("datum.Year == 2024") |>
  vl_mark_point(strokeWidth = 0.3  ) |> 
  vl_encode(longitude = "LONG:Q", latitude = "LAT:Q") |>
  vl_encode_color(value = "black") |> 
  vl_encode_fill("Type:N") |>
  vl_encode_size("Visits:Q",
                 legend = list(direction = "vertical")) |>
  vl_scale_size(bins = c(100, 100000, 5000000, 12000000, 18000000),
                rangeMin = 20, rangeMax = 300) |>
  vl_encode_opacity(value = 0.7) |>
  vl_add_data(final_parks_data) |>
  vl_encode_tooltip_array(c("UNIT_NAME", "Visits", "Type", "STATE")) 

map <- vl_layer(state_map, park_points) |>
  vl_add_properties(projection = list(type = "albersUSA"),
                    width = 450,
                    height = 300,
                    title = "Recreational Visits to National Park Service Lands in 2024") 

bar_graph <- 
vl_chart() |> 
  vl_mark_bar() |> 
  vl_encode_x("Visits:Q") |>
  vl_encode_y("UNIT_NAME:N",
              title = "",
              sort = list(field = "Visits", order = "descending")) |>
  vl_encode_fill("Type:N") |>
  vl_add_data(top_10_parks) |>
  vl_encode_tooltip_array(c("UNIT_NAME", "Visits", "Type", "STATE")) |>
  vl_add_properties(height = 200, width = 200, 
                    title = list(text = "Top 10 Most Visited",
                                 fontWeight = 300))

vl_vconcat(map, bar_graph) |>
  vl_add_properties(align = "center")

I chose to encode the type of national park land by color. I used the default color scheme which does well at providing distinct colors for the qualitative data. I wrangled the data quite a bit to combine the many original types into a smaller number of groups and I tried to keep their names as short and simple as possible. This hopefully makes it easier for the viewer to understand and comprehend the legend. I chose to encode the number of visits by size of each point. Because comprehending areas of circles is hard for the human eye, I tried to limit the number of size groups I chose and binned them so there are only four distinct sizes of points. I also tried to make sure the two layers for the map worked well together by decreasing the width of the state border lines and decreasing the opacity of the points so more can be seen when they overlap. I added a tooltip to both the map and bar graph so the viewer can see the specific information.

I decided to concat the map with the bar graph of the top 10 visited locations to highlight these for the viewer. I ordered the bars from greatest to the least visits because it is more meaningful to the viewer than an alphabetical order. I also colored the bars in the same way the points are colored on the map to stay consistent and help connect them. I considered many options for this second graph below the map. I did not give myself enough time and what I really wanted to do was have a line graph of the visitation over time for the location the viewer chose by clicking on the point on the map. I also considered adding a slider for the year so a person could see the change in visitation over time on both the map and bar graph. I could not figure out how to line up the bar graph and map when they are concatenated.

Exercise 5 - Using features

Where I used these features:

  • An encoding channel other than x or y: Masterpiece

  • Layers: Tanzania Graphic

  • Facets: Genetics Graphics

  • Concatenation or repeat: Masterpiece

  • Non-default settings for a channel’s scale or guide: Tanzania Graphic

  • Tooltips: Tanzania Graphic

  • Another kind of interaction (panning/zooming, brushing, sliders, etc.):

Exercise 6 - Learning

a. Name 2 or 3 examples of where I used a feature we did not discuss in class (a new kind of mark, transform, a way to customize something, a way to use an interaction, etc)

  • Sadly, I am not sure I used any new vegalite features that we did not discuss in class. However, I did figure out how to convert a shapefile to csv format and then save it as a file for my national park graphic. I also used some data wrangling functions in R that were not discussed in class.

b. Name 2 or 3 examples of where I followed the advice of Wilke or Knaflic

  • Wilke advises in section 6.1 to make sure your bars have a meaningful order to them when making bar graphs. I therefore ordered my bars in my “masterpiece” graphic from greatest to least visits rather than leaving them with their default order which was most likely alphabetical.

  • Knaflic advises on page 96 to label data directly when you can to de-clutter your graphic. Therefore, for my Tanzania graphic and graphic challenge, I labeled the lines on my graphs directly by adding additional text labels. I also followed Knaflic’s advice on page 97 to give the labels consistent colors which is described as a way of “leveraging the Gestalt principles of similarity.”